Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clinton/ed 448/sdk html and text #820

Merged
merged 36 commits into from
Jan 13, 2025

Conversation

clinton-encord
Copy link
Contributor

Introduction and Explanation

Add object and classification support for html-text

Some things left to do:

  • Add object and classification support for plain-text (although this may already be done due to html text being done, just haven't checked or added tests for it)
  • When calling ClassificationInstance.set_for_frames on a non-geometric label row (i.e. audio or text), validate the ranges
    • Audio range should be over entire file
    • Text range should be [0, 0]

JIRA

Link ticket(s)

Documentation

There should be enough internal documentation for a product owner to write customer-facing documentation or a separate PR linked if writing the customer documentation directly. Link all that are relevant below.

  • Internal: notion link
  • Customer docs PR: link
  • OpenAPI/SDK
    • Generated docs: link to example if possible
    • Command to generate: here

Tests

Make a quick statement and post any relevant links of CI / test results. If the testing infrastructure isn’t yet in-place, note that instead.

  • What are the critical unit tests?
  • Explain the Integration Tests such that it’s clear Correctness is satisfied. Link to test results if possible.

Known issues

If there are any known issues with the solution, make a statement about what they are and why they are Ok to leave unsolved for now. Make tickets for the known issues linked to the original ticket linked above

Copy link

github-actions bot commented Dec 11, 2024

Unit test report (Pydantic 1.x)

206 tests   206 ✅  5s ⏱️
  1 suites    0 💤
  1 files      0 ❌

Results for commit 32348fb.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Dec 11, 2024

Unit test report ((Pydantic 2.x)

206 tests   206 ✅  6s ⏱️
  1 suites    0 💤
  1 files      0 ❌

Results for commit 32348fb.

♻️ This comment has been updated with latest results.

Comment on lines -2074 to +2134
file_type=label_row_dict.get("file_type", None),
file_type=file_type,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the longest time, this has just been None lolol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we should be able to get the file_type on first requesting the label_row and instead of having to be included in the blob when we initialise labels:
i.e: The above method is called in lr.initialise_labels().
Probably worth checking whether we get the file_type before and we just don't provide it in the label blob and then here it gets overridden by the empty field in the actual label blob. Pretty confident that this is the case, checked. Let me know if this comment is obtuse cause this is a bad change somewhat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your comment is correct. I'm unsure as to what you want me to do though.
Are you trying to suggest that we don't overwrite it? And keep what we have in file_type before calling initialise_labels?

Comment on lines 481 to 489
if range_html is not None:
if self._ontology_object.shape != Shape.TEXT:
raise LabelRowError(
"Setting range_html of the object instance is only allowed for objects with the 'text' shape"
)
elif self._parent is not None and self._parent.file_type != "text/html":
raise LabelRowError("Cannot add range_html to a non-html text file")
elif len(self.range_list) > 0:
raise LabelRowError("Cannot add range_html to an object instance that has a non-html range")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An important set of checks to ensure range_html is used properly on an ObjectInstance.

Comment on lines 498 to 501
if len(range_html) > 0:
self._range_html = range_html
else:
self._range_html = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows a user to REMOVE the range_html, by setting it to an empty array.

Copy link

github-actions bot commented Dec 12, 2024

SDK integration test report

284 tests  ±0   277 ✅ +1   19m 25s ⏱️ -17s
  1 suites ±0     7 💤 ±0 
  1 files   ±0     0 ❌  - 1 

Results for commit ea71694. ± Comparison against base commit 0f9c697.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@Jim-Encord Jim-Encord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of comments. Please take care that you aren't breaking historical backwards compatiblity around the file type changes.

encord/objects/coordinates.py Show resolved Hide resolved
annotation = obj.get_annotations()[0]
ret[obj.object_hash]["range_html"] = [x.to_dict() for x in obj.range_html]
ret[obj.object_hash]["range"] = []
ret[obj.object_hash]["createdBy"] = annotation.created_by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:
Why so much C&P. Validation of non-type specific objects should be in a unified way in a unified place. i.e: Do createdBy stuff first and then use if statements to validate other fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha on C&P.

I don't understand the validation stuff.

elif self.data_type == DataType.PLAIN_TEXT or self.data_type == DataType.PDF:
pass
elif (
self.data_type == DataType.IMAGE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style:
Could use your introduced geometric_types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't work with exhaustive guard.


ret["labels"] = self._to_encord_labels(frame_level_data)

if self._label_row_read_only_data.duration is not None:
if self._label_row_read_only_data.duration is not None and self.data_type != DataType.PLAIN_TEXT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this condition?
Can we not ensure that for data_type PLAIN_TEXT, that we don't provide a duration.

It's not entirely clear to me why we need the original branching condition given that it seems fine to propagate the None into the ret dictionary. Maybe this is just someone else's ugly serialisation code.

Copy link
Contributor Author

@clinton-encord clinton-encord Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea you're right about the original branching condition. Ideally, we only have to check the data_type before adding properties to the data_unit, and not have to check whether the value from _label_row_read_only_data is None or not.

Maybe there was a historical reason for this? If no discernible reason, will clean up this function (albeit probably in a separate PR to reduce surface area)

@sergei-encord any thoughts on this?

encord/objects/ontology_labels_impl.py Outdated Show resolved Hide resolved
Comment on lines -2074 to +2134
file_type=label_row_dict.get("file_type", None),
file_type=file_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we should be able to get the file_type on first requesting the label_row and instead of having to be included in the blob when we initialise labels:
i.e: The above method is called in lr.initialise_labels().
Probably worth checking whether we get the file_type before and we just don't provide it in the label blob and then here it gets overridden by the empty field in the actual label blob. Pretty confident that this is the case, checked. Let me know if this comment is obtuse cause this is a bad change somewhat.

encord/objects/ontology_object_instance.py Outdated Show resolved Hide resolved
if "createdAt" in d and d["createdAt"] is not None:
created_at = parse_datetime(d["createdAt"])
else:
created_at = datetime.now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😮
This doesn't make sense to me.
On creation yes, we should be providing a created_at timestamp. But I strongly disagree that on fail find, we should fallback to providing now as the created_at. I'd rather have None or the Unix Epoch time than now.

If you are using this branch as a mechanism somehow of updating created_at, if this is the first time creating labels, then I would reconsider refactoring such that it doesn't have this risky side effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that is very correct! I put this in at the start just to get it working. CreatedAt should always be there, no defaults required.

label_row.add_classification_instance(classification_instance)
assert len(label_row.get_classification_instances()) == 1
classification_instance.set_for_frames(Range(start=2000, end=2499))
range_list = classification_instance.range_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may just be personal:
This shouldn't be an ordered list imo, unless we are sorting it on requesting to ensure that we sort by the start index (If you are doing that then great!). The actual notion of classifications as you might view them in the label editor has the chronological ordering or here the lexicographic but if we are guaranteeing a stable order, it really needs to be derived from that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite understand this. ChatGPT suggests that:

They are questioning whether range_list is inherently or intentionally ordered. If it's being treated as an ordered list, they are asking whether you are explicitly sorting it by a meaningful criterion (e.g., the start index of ranges).

If that is the case, then yes, range_list is being sorted by start index of ranges. See the add_range function in encord/common/range_manager.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, yeah I was asking whether it has a stable ordering

encord/objects/classification_instance.py Outdated Show resolved Hide resolved
encord/objects/html_node.py Outdated Show resolved Hide resolved
offset (int): The offset of the content from the xpath
"""

node: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should call this xpath? ATM node is already in the class name HtmlNode, so feels like the field name can be more descriptive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the actual range_html in the objects_index, we call it "node" though.

Worried it might be confusing to have it called "node" in the exports and in our DB, but then have it called "xpath" here in this one place.

encord/objects/html_node.py Outdated Show resolved Hide resolved
tests/objects/test_label_structure.py Outdated Show resolved Hide resolved
@clinton-encord clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from fc26087 to 92ec53a Compare December 30, 2024 09:15
@clinton-encord clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from 6fac349 to 1084f6f Compare December 31, 2024 09:27
@arthur-encord arthur-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from 51454cc to b6f0245 Compare January 2, 2025 09:01
range (Ranges): Ranges of chars for simple text files
"""

range_html: Optional[List[HtmlRange]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be plural? ranges and ranges_html (and same for Audio above? could be fine to change since no one is using it yet I assume)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap agreed, but in the label format, we just call it range though. So using the same here for consistency.

@@ -31,7 +31,7 @@

from encord.common.range_manager import RangeManager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to expect label_row.add_classification_instance(cls_instance) to work for text/html/audio without having to call cls_instance.set_for_frames, given that we only support global classifications for these?

return f"{self.start.node}-{self.start.offset}-{self.end.node}-{self.end.offset}"

@classmethod
def from_dict(cls, d: dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use this method to be able to construct TextCoordinates.range_html from a dict rather than having to import HtmlRange & HtmlNode?
No big deal though, it's fine as is

range_html: Optional[List[HtmlRange]] = None
range: Optional[Ranges] = None

def __post_init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could separate TextCoordinates and HtmlCoordinates to not have to do the logic below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense!

"""

range_html: Optional[List[HtmlRange]] = None
range: Optional[Ranges] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be nice for the text ranges to be able to be a list of int as well?

@clinton-encord clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from b6f0245 to 5c18d7a Compare January 2, 2025 14:20
@clinton-encord clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from dbcb6c3 to 3ee74fc Compare January 6, 2025 05:27
@clinton-encord clinton-encord force-pushed the clinton/ed-448/sdk-html-and-text branch from f14e19b to ea71694 Compare January 13, 2025 08:28
@clinton-encord clinton-encord merged commit fcb200a into master Jan 13, 2025
5 of 6 checks passed
@clinton-encord clinton-encord deleted the clinton/ed-448/sdk-html-and-text branch January 13, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants